Search CORE

182 research outputs found

On the Distribution of Speaker Verification Scores: Generative Models for Unsupervised Calibration

Author: Cumani S.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

Speaker verification systems whose outputs can be interpreted as log-likelihood ratios (LLR) allow for cost-effective decisions by comparing the system outputs to application-defined thresholds depending only on prior information. Classifiers often produce uncalibrated scores, and require additional processing to produce well-calibrated LLRs. Recently, generative score calibration models have been proposed, which achieve calibration performance close to that of state-of-the-art discriminative techniques for supervised scenarios, while also allowing for unsupervised training. The effectiveness of these methods, however, strongly depends on their capabilities to correctly model the target and non-target score distributions. In this work we propose theoretically grounded and accurate models for characterizing the distribution of scores of speaker verification systems. Our approach is based on tied Generalized Hyperbolic distributions and overcomes many limitations of Gaussian models. Experimental results on different NIST benchmarks, using different utterance representation front-ends and different back-end classifiers, show that our method is effective not only in supervised scenarios, but also in unsupervised tasks characterized by very low proportion of target trials

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Fast and Memory Effective I-Vector Extraction Using a Factorized Sub-Space

Author: Cumani S.
Laface P.
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2013
Field of study

Most of the state–of–the–art speaker recognition systems use a compact representation of spoken utterances referred to as i–vectors. Since the ”standard” i–vector extraction procedure requires large memory structures and is relatively slow, new approaches have recently been proposed that are able to obtain either accurate solutions at the expense of an increase of the computational load, or fast approximate solutions, which are traded for lower memory costs. We propose a new approach particularly useful for applications that need to minimize their memory requirements. Our solution not only dramatically reduces the storage needs for i–vector extraction, but is also fast. Tested on the female part of the tel-tel extended NIST 2010 evaluation trials, our approach substantially improves the performance with respect to the fastest but inaccurate eigen-decomposition approach, using much less memory than any other known method

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Generative pairwise models for speaker recognition

Author: Cumani S. Laface P.
Publication venue: International Speech Communication Association
Publication date: 01/01/2014
Field of study

This paper proposes a simple model for speaker recognition based on i-vector pairs, and analyzes its similarity and differences with respect to the state-of-the-art Probabilistic Linear Discriminant Analysis (PLDA) and Pairwise Support Vector Machine (PSVM) models. Similar to the discriminative PSVM approach, we propose a generative model of i-vector pairs, rather than an usual i-vector based model. The model is based on two Gaussian distributions, one for the "same speakers" and the other for the "different speakers" i-vector pairs, and on the assumption that the i-vector pairs are independent. This independence assumption allows the distributions of the two classes to be independently estimated. The "Two-Gaussian" approach can be extended to the Heavy-Tailed distributions, still allowing a fast closed form solution to be obtained for testing i-vector pairs. We show that this model is closely related to PLDA and to PSVM models, and that tested on the female part of the tel-tel NIST SRE 2010 extended evaluation set, it is able to achieve comparable accuracy with respect to the other models, trained with different objective functions and training procedure

PORTO Publications Open Repository TOrino

Training pairwise Support Vector Machines with large scale datasets

Author: Cumani S.
Laface P.
Publication venue: IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
Publication date
Field of study

We recently presented an efficient approach for training a Pairwise Support Vector Machine (PSVM) with a suitable kernel for a quite large speaker recognition task. The PSVM approach, rather than estimating an SVM model per class according to the “one versus all” discriminative paradigm, classifies pairs of examples as belonging or not to the same class. Training a PSVM with large amount of data, however, is a memory and computational expensive task, because the number of training pairs grows quadratically with the number of training patterns. This paper proposes an approach that allows discarding the training pairs that do not essentially contribute to the set of Support Vectors (SVs) of the training set. This selection of training pairs is feasible because we show that the number of SVs does not grow quadratically, with the number of pairs, but only linearly with the number of speakers in the training set. Our approach dramatically reduces the memory and computational complexity of PSVM training, making possible the use of large datasets, including many speakers. It has been assessed on the extended core conditions of the 2012 Speaker Recognition Evaluation. The results show that the accuracy of the trained PSVMs increases with the training set size, and that the Cprimary of a PSVM trained with a small subset of the i–vectors pairs is 10-30% better than the one obtained by a generative model trained on the complete set of i–vectors

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Generative pairwise models for speaker recognition

Author: Cumani S.
Laface P.
Publication venue: 'International Speech Communication Association'
Publication date
Field of study

This paper proposes a simple model for speaker recognition based on i–vector pairs, and analyzes its similarity and differences with respect to the state–of–the–art Probabilistic Linear Discriminant Analysis (PLDA) and Pairwise Support Vector Machine (PSVM) models. Similar to the discriminative PSVM approach, we propose a generative model of i–vector pairs, rather than an usual i–vector based model. The model is based on two Gaussian distributions, one for the “same speakers” and the other for the “different speakers” i–vector pairs, and on the assumption that the i–vector pairs are independent. This independence assumption allows the distributions of the two classes to be independently estimated. The “Two–Gaussian” approach can be extended to the Heavy–Tailed distributions, still allowing a fast closed form solution to be obtained for testing i–vector pairs. We show that this model is closely related to PLDA and to PSVM models, and that tested on the female part of the tel–tel NIST SRE 2010 extended evaluation set, it is able to achieve comparable accuracy with respect to the other models, trained with different objective functions and training procedures

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Memory-aware i-vector extraction by means of subspace factorization

Author: Cumani S.
Laface P.
Publication venue: IEEE - INST ELECTRICAL ELECTRONICS ENGINEERS INC
Publication date
Field of study

Most of the state–of–the–art speaker recognition systems use i– vectors, a compact representation of spoken utterances. Since the “standard” i–vector extraction procedure requires large memory structures, we recently presented the Factorized Sub-space Estimation (FSE) approach, an efficient technique that dramatically reduces the memory needs for i–vector extraction, and is also fast and accurate compared to other proposed approaches. FSE is based on the approximation of the matrix T, representing the speaker variability sub–space, by means of the product of appropriately designed matrices. In this work, we introduce and evaluate a further approximation of the matrices that most contribute to the memory costs in the FSE approach, showing that it is possible to obtain comparable system accuracy using less than a half of FSE memory, which corresponds to more than 60 times memory reduction with respect to the standard method of i–vector extraction

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Large scale training of Pairwise Support Vector Machines for speaker recognition

Author: Cumani S.
Laface P.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

State–of–the–art systems for text–independent speaker recognition use as their features a compact representation of a speaker utterance, known as “i–vector”. We recently presented an efficient approach for training a Pairwise Support Vector Machine (PSVM) with a suitable kernel for i–vector pairs for a quite large speaker recognition task. Rather than estimating an SVM model per speaker, according to the “one versus all” discriminative paradigm, the PSVM approach classifies a trial, consisting of a pair of i–vectors, as belonging or not to the same speaker class. Training a PSVM with large amount of data, however, is a memory and computational expensive task, because the number of training pairs grows quadratically with the number of training i–vectors. This paper demonstrates that a very small subset of the training pairs is necessary to train the original PSVM model, and proposes two approaches that allow discarding most of the training pairs that are not essential, without harming the accuracy of the model. This allows dramatically reducing the memory and computational resources needed for training, which becomes feasible with large datasets including many speakers. We have assessed these approaches on the extended core conditions of the NIST 2012 Speaker Recognition Evaluation. Our results show that the accuracy of the PSVM trained with a sufficient number of speakers is 10-30% better compared to the one obtained by a PLDA model, depending on the testing conditions. Since the PSVM accuracy increases with the training set size, but PSVM training does not scale well for large numbers of speakers, our selection techniques become relevant for training accurate discriminative classifiers

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Speaker recognition by means of Deep Belief Networks

Author: Cumani S.
Laface P.
Vasilakakis V.
Publication venue: BBfor2
Publication date
Field of study

Most state–of–the–art speaker recognition systems are based on Gaussian Mixture Models (GMMs), where a speech segment is represented by a compact representation, referred to as “identity vector” (ivector for short), extracted by means of Factor Analysis. The main advantage of this representation is that the problem of intersession variability is deferred to a second stage, dealing with low-dimensional vectors rather than with the high-dimensional space of the GMM means. In this paper, we propose to use as a pseudo-ivector extractor a Deep Belief Network (DBN) architecture, trained with the utterances of several hundred speakers. In this approach, the DBN performs a non-linear transformation of the input features, which produces the probability that an output unit is on, given the input features. We model the distribution of the output units, given an utterance, by a reduced set of parameters that embed the speaker characteristics. Tested on the dataset exploited for training the systems that have been used for the NIST 2012 Speaker Recognition Evaluation, this approach shows promising results

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Memory and computation effective approaches for i–vector extraction

Author: Cumani S.
Laface P.
Vasilakakis V.
Publication venue: ISCA
Publication date
Field of study

This paper focuses on the extraction of i-vectors, a compact representation of spoken utterances that is used by most of the state–of–the–art speaker recognition systems. This work was mainly motivated by the need of reducing the memory demand of the huge data structures that are usually precomputed for fast computation of the i-vectors. We propose a set of new approaches allowing accurate i-vector extraction but requiring less memory, showing their relations with the standard computation method introduced for eigenvoices. We analyze the time and memory resources required by these solutions, which are suited to different fields of application, and we show that it is possible to get accurate results with solutions that reduce both computation time and memory demand compared with the standard solution

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Language Recognition Using Language Factors

Author: CASTALDO F
COLIBRO D
CUMANI S
LAFACE P.
Publication venue: 'The International Fiscal Association of Korea'
Publication date
Field of study

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)